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Abstract 

When a program is loaded into memory for execution, the relative 
position of its basic blocks is crucial, since loading basic blocks that are 
unlikely to be executed first places them high in the instruction-memory 
hierarchy only to be dislodged as the execution goes on. In this paper 
we study the use of Bayesian networks as models of the input history of 
a program. The main point is the creation of a probabilistic model that 
persists as the program is run on different inputs and at each new input 
refines its own parameters in order to refiect the program's input history 
more accurately. As the model is thus tuned, it causes basic blocks to be 
reordered so that, upon arrival of the next input for execution, loading 
the basic blocks into memory automatically takes into account the input 
history of the program. We report on extensive experiments, whose results 
demonstrate the efficacy of the overall approach in progressively lowering 
the execution times of a program on identical inputs placed randomly in a 
sequence of varied inputs. We provide results on selected SPEC CINT2000 
programs and also evaluate our approach as compared to the gcc level-3 
optimization and to Pettis-Ifansen reordering. 

Keywords: Instruction memory. Code-layout optimization, Bayesian 
networks. 
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1 Introduction 



It is a well-known fact that only a small fraction of a program's instructions 
is responsible for most of its running time. Coupled with the growing gap 
that exists between memory and processor performance (161 , this has over the 
years led to the search for code-layout techniques for optimizing the use of the 
memory system. The essential guiding principle in this search is that the first 
of a program's basic blocks to be loaded into memory should be precisely those 
that are most likely to be executed. 

The earliest efforts related to optimizing code layout concentrated on virtual- 
memory systems and aimed at producing code layouts that could reduce 
the number of page faults at runtime. The advent of TLB's and the introduc- 
tion of several cache levels in recent processors have both shifted the context 
considerably and added new momentum to the search for efficient techniques. 
Naturally, the focus of this search is invariably placed on the investigation of 
heuristic techniques, since the optimality of a code layout cannot in general be 
decided p. 

Notable contributions within this more recent context include some that tar- 
get the reduction of the instruction-cache miss rate W, "F, TT , or the reduction 
of cache pollution and bus traffic 5 , or yet the reduction of the program's 
running time mi 12 El- Most of these contribut ions involve instrumenting the 
program for trace recording and the eventual construction of a profile on which 
the code-layout optimization technique operates. Some others, however, con- 
centrate solely on the development of new compile-time techniques or combine 
profiling with compilation strategies. 

Our interest in this paper is to investigate the construction of a dynamic 
probabilistic model of the inputs to a program. Specifically, as the program is 
run on an assorted sequence of inputs, we describe how a probabilistic model 
can be dynamically updated so that, at all times, it reflects the input history of 
the program and can as such be used to update the program's code layout for 
subsequent use. Like in so many of the techniques mentioned above, this model 
is built on the profile of the program's execution on each input. Unlike those 
techniques, however, the one we introduce is based on a model that persists from 
one execution of the program to the next while refining itself as information on 
each new execution comes in. What we ultimately seek is the improvement of 
running times with as much generality as possible (this includes, for example, 
independence from specific cache sizes, once again in contrast to some of the 
related approaches). 

The following is a high-level description of our overall strategy. Let P be 
the binary code of some program for the architecture at hand, and /, /' two 
independent inputs to P. Suppose we have a means of instrumenting P so 
that running it on input / yields an abstract model of this particular execution 
which can be used to estimate the set of basic blocks of P that is most likely 
to occur in future executions. If this is the case, then we can reorder the basic 
blocks of P in such a way that, when input /' comes along for execution, the 
basic blocks to occupy the highest levels in the instruction-memory hierarchy 
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are none other than some of those that were previously estimated to be the 
most hkely to occur. That estimate, of course, was based solely on input /, 
so it may work rather poorly on the new input /'. However, the execution of 
P on I' can itself be instrumented and the model resulting from this second 
execution can be combined with the previous one in the hope of a more general, 
less execution-dependent model to be used in a subsequent run of P. 

Our central premise in this paper is that such modeling of executions of P 
can indeed be achieved and used successfully toward progressively more efficient 
runs of P as it is applied to a stream of inputs. The model that we build of 
a particular execution of P is based on recording a trace of the execution as 
it goes through the basic blocks of P and then using the data in the trace to 
construct a Bayesian network |10l |31 E| ■ Combining this Bayesian network 
with another that records a history of all previous executions of P, and then 
solving the resulting Bayesian network for the most likely combination of basic 
blocks, is what gives us the prediction capability that allows for progressively 
more efficient executions. Sections |5| and |31 contain, respectively, the details of 
our model-building and -updating methodology, and a summary of our overall 
strategy. Section 0] contains the results of extensive experimentation on the 
SPEC CINT2000 suite [T3]. 

One immediate difficulty with this approach is of course that it may take 
considerable effort for the refined predictive model to be obtained from an ex- 
ecution: not only does instrumenting P slows it down significantly, but also 
setting up the Bayesian network and solving it may be quite time-consuming. 
The entire strategy would then seem to be wholly inappropriate for a real- world 
environment, since any gain that might eventually be accrued would be totally 
overshadowed by the cost to obtain it. But we envisage a different dynamics 
for the successful application of our approach, one that only applies it to a sub- 
stream of the stream of inputs to P, in such a way that rearranged versions of 
P only become available every so often, as opposed to becoming available right 
after every new input is processed. We provide further considerations on this in 
Section 121 along with conclusions. 

2 The model 

We consider a sequence /i,/2, . . . of inputs to P, along with a corresponding 
sequence of directed graphs Gi, G2, . . ., where for i > 1 graph Gi represents the 
recorded trace of executing P on input /j . Each node of Gi is a basic block of P 
that is reached during that execution. A directed edge exists in Gi from basic 
block a to basic block 6, denoted by (a — > 6), if during the execution b follows a 
immediately at least once. In this case, the trace is complemented by a positive 
count, denoted by fab, indicating the number of times this happens. Notice that 
the node sets of Gi, G2, . . . are not necessarily the same, even though they are 
all subsets of the set of P's basic blocks. 

Transforming each of these edge-labeled directed graphs into a Bayesian 
network is one of the crucial steps of modeling the executions of P that they 
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stand for. Before describing how the transformation is achieved, we pause briefly 
for a discussion of the basic principles of Bayesian networks. 

2.1 Bayesian-network basics 

A Bayesian network is a node-labeled acyclic directed graph whose nodes are 
random variables and whose directed edges represent the existence of direct 
causal influences. In other words, if X and Y are nodes, then the existence of 
the directed edge {X Y) indicates that the value of X influences the value 
of Y directly. We use Ilx to denote the set of variables from which edges exist 
directed toward X (the so-called parents of X in the Bayesian network). If 
we let TTx denote a joint value assignment to the variables in Ilx, then for 
each possible tvx the label that goes with node X to complete the definition 
of the Bayesian network includes the conditional probability p{x \ ■kx) that 
X has value x given the values of X's parents appearing in i^x- In the case 
of 0, 1-variables, we need 2'^^^' such probabilities (which may be problematic, 
depending on the size of Ilx); if — 0, then the single necessary probability 
is known as the prior probability of X . 

One facilitating assumption that is always made in the study of Bayesian 
networks is that conditioning the value of X on the values of the variables in 
Tix is the same as conditioning on the values of all the variables that cannot 
be reached from X along directed paths. Given this assumption, and letting x 
denote a joint value assignment to all the variables in the Bayesian network, it 
is simple to see that 

P(x) = Jl p(a; I TTx), (1) 

where X is the set of all variables (the node set of the Bayesian network) , x is 
the value assigned to X in x, and nx comprises the values assigned in x to the 
variables in Ilx- 

In the context of this paper, the key problem to be solved once the Bayesian 
network has been set up is the following. Let E C X comprise variables whose 
values are no longer uncertain but known with certainty. These are the so-called 
evidences and the problem asks for the joint value assignment to the variables 
in X \ E that maximizes p(x \ e | e), where x \ e denotes one such joint value 
assignment and e the evidences' values. 

This and other similar problems are in general computationally intractable, 
in the sense of NP-hardness, even though p(x \ e | e) can be derived from ^ 
rather straightforwardly. This inherent difficulty stems essentially from the exis- 
tence of multiple paths joining two nodes in the undirected graph that underlies 
the Bayesian network, and also from the absence of a constant bound on the 
sizes of the Wx sets. 

There are several approximation schemes that can be used. The one we use 
in this paper is based on recognizing first that p(x \ e | e) is proportional, by a 
normalizing constant, to the p(x) of J^l, and further that maximizing p(x) over 
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the possibilities for x \ e is equivalent to minimizing the function 



^ lnp(a; | ttx) 



(2) 



over the same possibilities when the distribution in Q is everywhere strictly 
positive. 

This minimization, in turn, can be achieved by a variation of stochastic 
simulation that employs simulated annealing in an attempt at near-optimality. 
If T is the temperature-like parameter of simulated annealing, then whenever 
during the process variable X is to be updated, it is assigned value x with 
probability proportional (by a normalizing constant) to 



given X as the value of X and the current joint value assignment to some of the 
other variables in X. In |j3Jl, Nx comprises X itself and its so-called children 
(variables Y such that {X Y) is an edge); indirectly, then, the probability 
depends only on x, on the current values of X's parents and children, and also 
on the current values of its children's other parents. These are all well-known 
results and for details we refer the reader to the pertinent literature ,3,, ■ Our use 
of the technique in this paper is concentrated in Section 0] where the necessary 
details are filled in. 

As one last remark, notice that approximation schemes like the one we just 
outlined do nothing to handle the potentially problematic sizes of the IIx sets as 
far as storing labels that depend on such sets is concerned. The issue is crucial, 
though, and we return to it shortly. 

2.2 The execution model 

For i > 1, we model the execution of program P on input li by a Bayesian net- 
work denoted by Bi. Constructing Bi involves transforming the edge-labeled 
directed graph d into the node- labeled, acychc directed graph Bi. We describe 
this transformation as a sequence of two steps. The first step transforms Gi into 
an acyclic directed graph G- that already has the desired structure of Bi but 
still carries integer labels on its edges. The second steps completes the transfor- 
mation into Bi by computing its node labels (sets of conditional probabilities) 
from the edge labels of G ■ . 

The node set of GJ contains one random variable for each of the nodes of 
Gi, that is, for each of the basic blocks of P that is executed when P is run 
on input li. A variable may only have value or 1, representing respectively 
the event that the corresponding basic block is not or is executed in a run 
of P. Notice that this does in no way contradict the fact that, by definition, 
all basic blocks in Gi are executed in the run of P to which it corresponds. 
Building the probabilistic model Bi is simply an intermediate step that seeks to 
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later integrate the contribution of this particular run into a model of the input 
history of P. We let Xa be the variable corresponding to basic block a. 

The node sets of Gi and G[ are thus so far in one-to-one correspondence with 
each other. Their edge sets, on the other hand, cannot in general have the same 
property, since G- (having the same structure of Bi) must be acyclic, while Gi 
is in general not so. The source of directed cycles in Gi is of course the presence 
of basic blocks whose last instruction is a branch instruction to implement a 
loop in P. The edges of Gi that correspond to such branches are precisely the 
ones that get eliminated in order to generate G[, which is then acyclic. 

The process of eliminating branch edges is very simple. First we identify the 
only possible source node in Gi, i.e., the single node that has no edges incoming 
to it. This node represents the first basic block that is executed when P is run 
on li; since P is fixed, such a node is the same for all values of i. Once the 
source node is identified, a depth-first search is conducted starting at it and 
all the back edges it produces are eliminated. Note that, even though it can 
be argued that these back edges correspond precisely to the branch edges that 
implement loops in P, what matters is simply that the resulting directed graph 
is acyclic. 

But let us examine the process of eliminating a back edge from Gi more 
closely. By definition of the edge labels of Gi, summing up the labels on the 
edges incoming to any of its nodes (with the exception of its single source and 
its sinks — nodes with no outgoing edges) must yield the same value as summing 
up the labels on that node's outgoing edges. If (a — > &) is a back edge, severing 
it disrupts this balance so that the resulting graph no longer conveys the same 
information as Gi regarding the relative frequencies with which basic blocks 
are executed. What we do to solve this is to create two additional nodes (one 
source and one sink), called a' and b', and two additional edges, (a' ^ b) and 
(a b'), each receiving the same label, /ah, of the edge being severed. Clearly, 
the desired balance is thus maintained. The resulting node and edge sets are 
shared by both G- and Bi. 

Let us now consider the second step in turning Gi into Bi, that is, the step 
whereby the edge labels of G^ are transformed into the node labels of Bi. A node 
Xa in G^ or Bi has |nx„ | parents, and for each of the 2l'^^» ' possible joint value 
assignments to those parents, say the value assignment i^Xa j the conditional 
probability p(0 | ttx^) (or, equivalently, p(l \ ttXo)) must be provided as part 
of the label of Xa. Evidently, requiring such an exponentially large number of 
label components is impractical even for moderately complex instances of P and 
some more efficient representation must be adopted. 

Our choice on this issue has been to adopt the customary noisy-OR assump- 
tion |13j . whose core in our context is the following. Let Xa be a node with at 
least one parent, and let Xa^, . . . ,Xa^ be its parents, with a = |nx„|. The as- 
sumption is that whatever causes the event Xa = 1 to be unaffected by the event 
Xaf, = 1 is independent from whatever else may cause Xa = 1 to be unaffected 
by Xai = 1, where Xa^. and Xa^ are any two distinct parents of Xa. If we let 
Tr\^ denote the joint value assignment to the parents of Xa that sets Xa^ — 1 



6 



Xai ^Ofc 




Figure 1: The surroundings of variable Xa that are relevant to JSJ. 



and all other parents to 0, then clearly the noisy-OR assumption amounts to 
P{0\^xj= n P(0|^1)' (4) 

k = l 

where the = 1 condition indicates that the product ranges over the parents 
of Xa that have value 1 in tvx^ ■ By Q , only the a conditional probabilities 
p{0 I . . . ,p(0 I T^xJ need to be specified: the remaining 2l^^«' — a ones 

can be easily computed as they become necessary. 

For 1 < k < <T, the following is how we obtain the value of p{0 \ tv\ ) for 
use in Bi. Let Xf,j, . . . , Xi,^ be the children of Xa^. (including, of course, Xa). 
We then let 

P(0|7r^J = l-7 /""^ , ■ (5) 

In words, what |(SJ) is saying is that p{l \ tt^ ) is, of all the times the execution 
of P goes through basic block Uk , the fraction in which it proceeds directly to 
basic block a. An illustration depicting the variables involved in this process is 
given in Figure ^ 

All we are left to do is then to handle the case in which Xa has no parent 
and therefore needs a prior probability. This case includes the single source 
inherited by Bi from Gi and all the ones we inserted artificially when severing 
back edges during the transformation of Gi into G'^. However, as will become 
apparent in Section|21 all such prior probabilities are irrelevant and we need not 
worry about them. This is so because the source that represents the initial basic 
block is an evidence (so its value never changes) and all the other sources are 
treated, during the solution of the Bayesian network, in a somewhat unorthodox 
way intended to ensure that their values are consistent with those of the sinks 
that were created along with them. 

2.3 The history model 

For i > 1 (that is, after P has been run at least once), the history model 
of the first i — 1 executions is a Bayesian network, denoted by which 
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incorporates probabilistic knowledge on the occurrence of the basic blocks of 
P as it is executed. We now describe how to incorporate the probabilistic 
knowledge that Bi embodies about the ith. execution of P into Hi^i so that 
a new Bayesian network now incorporating information on the first i executions, 
can be obtained. 

In order to achieve the combination of -ffi-i and Bi into Hi, first we must 
ensure that the two Bayesian networks we start with have the same node and 
edge sets. This can be achieved simply by first determining the union of the 
two node sets and the union of the two edge sets, and then enlarging each node 
set to make it equal the union of the node sets, then similarly for the edge 
sets. The only problem with this is that it leaves some node labels incomplete, 
i.e., in both Hi-i and Bi there may be a non-source variable X with less than 
jllxl conditional probabilities specified for it. Each missing probability is a 
probability that X = given that a newly added parent has value 1 and all 
others value 0. What we do in these cases is to set all missing probabilities to 
a small value e G (0, 1).^ 

We may then henceforth assume that -ff^-i and Bi have the same node and 
edge set, and also that they have labels completely specified within the noisy-OR 
assumption for all non-source nodes. These shared node and edge sets are also 
the node and edge sets of the resulting Bayesian network, Hi. Let X denote this 
common node set and x stand for a joint value assignment to all the variables 
in X. 

We would like, ideally, to obtain the node labels of Hi in such a way as to 
ensure that the resulting joint distribution over X were the (normalized) geo- 
metric average of the two source joint distributions, i.e., those that correspond 
to Hi-i and Bi. The geometric average of two distributions seems only natural 
in the Bayesian-network context, since it involves products of probabilities and 
such products already lie at the core of any analysis of Bayesian networks (cf. 
(^). So if Pi is the probability distribution for Bi and qi the distribution for 
Hi, then we would aim at having, with G (0, 1) and rii = |X|, 

I ^ _ g.-i(x)i-°-p,(x)"' 

for all X G {0, 1}"'. And in fact it is easy to demonstrate that © is achieved if 
it is also achieved at the node-label level, that is, if 

for all X G X, all x G {0, 1}, and all ttx G {0, l}!"-'^!. 

Let us digress briefly to outline the main argument of this demonstration. 
If we assume that {T)) holds as stated, then we obtain, for all x G {0, 1}"* and 

^We note that it is critical that e be a strictly positive value. Sotting such probabilities 
to disrupts the fundamental nature of a Bayesian network as a Markov (or, equivalently, 
a Gibbs) random field, in which case all the theory that underlies the optimization process 
summarized by j^J crumbles E). 
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starting with an application of (Q, 

Qii^) = Jl ^^(a; I TTx) (8) 



Rewriting the denominator yields 



(9) 



(10) 



where x' is the value of X in x'. By this leads to 

The problem is that the qi(x \ ttx) of O is in general not compliant with 
the noisy-OR assumption we made in Section |2.2I even if qi^i{x \ ttx) and 
Pi{x I TTx) are. In order to see this, we assume the latter and rewrite ((7J using 
the notation of Section lT^ with a = jlljc |; by Q), we get, for instance for x = 0, 

nVi^g.-i(0|7r^)i-"'P.(0|7r^)"- 

"^'^^ ' " n Jt' I 7r^^l-".r,-rT' I tt^^"- ' ^^^^ 

where the denominator can also be rewritten: 

n ft-l(0 I TT^)^ — P.(0 I TT^)"' 



k=l 



k=l / ^ fc=l 



Clearly, for noisy-OR compliance we should have 

n(n\^ ) rr g.-l(0 I 7r^)l-"'P.(0 I TT^)" 

gi(0 I ttx) = II 



L\ E.'e{o,i} 9-1(2^' I 7r^)i-"-p»(x' I tt^)"' 



(13) 



but the two denominators are not in general equal. 

The inescapable conclusion is then that we must choose between the concise 
node-label representations afforded by the noisy-OR assumption and achieving 
JHJl through {Tj). Given our application domain, in which variables with hundreds 
of parents do occur, the ability to represent node labels parsimoniously is abso- 
lutely essential. We then choose the first option while the second one remains 
an ideal to be approximated. Having opted for conciseness, it then suffices to 
apply the geometric- average rule of O to the jllxl conditional probabilities of 
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Figure 2: Contour plots of r = g^^^p"/ (gi->" + (1 - -p)") with 

< p,q < 1 ior a — 0.2 (a) and a = 0.8 (b). As a is increased from 0.2 in (a) 
to 0.8 in (b), r becomes more sensitive to the value of p than to the value of q. 




each and every variable X G X that is not a source (prior probabilities, recall, 
are in our context needless). 

It now remains for us to find a suitable value for ai . The strategy we use is 
based on the following general premise. Suppose we can devise an ideal value, 
call it OfQ, for the mixture of the two Bayesian networks. This can be done, 
for instance, by running P a number of times on a randomly chosen sequence 
of inputs, each time with a different candidate value for ao, and at the end 
selecting the value that yields the smallest overall running time. The chosen 
ao can then be used as a sort of threshold: after P is run on li and Bi is 
obtained, we check its running time against some average of the running times 
of P on the previous i ~ 1 inputs; if smaller we select a value for that is 
smaller than ao, and correspondingly if it is larger. What this is doing, since 
we are dealing with geometric averages of numbers below 1, is to let executions 
with comparatively larger running times weigh more in the history model than 
executions with comparatively smaller running times (essentially, the probability 
that gets raised to the smallest exponent yields, if large enough, the result that 
is nearest 1 and therefore affects the geometric average the least — cf. Figure 12 
for a clarifying illustration). 

Now for the details. Let ti,t2, ■ ■ ■ be the running times of P on /i, /2, . . ., 
respectively. Let T^-i be the average, weighted by normalized versions of 
ai, . . . , Q!i_i, of the first i — 1 nmning times, that is, 

_ aiti H h ai^iU^i 

J-i-i ~ ; ; • U4j 

a\A V OLi^i 



The value we use for a, is then 



(15) 
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Figure 3: Plots of a = (l + e-^ — Q;o)/ao) ^ with T = 6 for ao = 
0.2,0.5,0.8. 

where 7 > is a parameter. The functional dependency in (|15() has a sigmoidal 
form and maps U into the interval [0, 1]. It yields at = ao when U = Ti-i, that 
is, when P runs on the ith input as fast as it has on average; smaller values of ti 
bring at closer to 0, larger values bring it closer to 1. The function's steepness 
around U = Ti_i is controlled by the 7 parameter. Illustrations with 7 = 1 are 
given in Figure |3| 

All of our discussion concerning the evolution of the history model holds, of 
course, for i > 1. For i = 1, no previous history model exists and Bi simply 
becomes Hi while we let Ti — ti. In addition, determining ai for later use 
requires that Tq be known as well; there are various possibilities, one being to 
take To as the average running time of P during the initial experiments that 
yield the value of ao- 

3 The overall strategy 

The following is a summary of our strategy in this paper. It provides the 
main steps to be followed as the program P is run, in succession, on the inputs 
Ii, /2, . . .. We assume that running P on li for i > 1 automatically produces the 
Bayesian network Bi as explained in Section and also yields the measure ti 
for the running time of P on Ii. This running time is assumed to exclude all the 
instrumentation effort that creates Bi . Wc also assume that the value of ao is 
known from previous experimentation with P on a random assortment of inputs, 
and furthermore that the average running time of P during the experiments is 
recorded as Tq. 

1. Run P on /i, then do: 

(a) Determine ai from (|15|l . 

(b) Let Ti = ti. 




2 4 6 8 10 12 

t 
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(c) Duplicate Bi to yield Hi. 

(d) Solve Hi for the most likely joint occurrence of basic blocks, then 
reorder the basic blocks of P accordingly. 

2. For i > 1, run P on li, then do: 

(a) Determine ai from H15|) . 

(b) Let 

T,, 



aiti H h a^ti 



ai + ■ ■ ■ + Ui 

(c) Obtain the node and edge set of Hi as the union, respectively, of the 
node and edge sets of Hi-i and Bi. Let X be the node set of Hi. 

(d) For X G X, do the following if X is not a source. Let tr — \llx\ 
and Hx = {Xi, . . . , Xc}. Let also tt^, with 1 < fc < cr, be the joint 
value assignment to the variables in Tlx that assigns to all variables 
except Xk, which receives value 1. For /c = 1, . . . , cr, let 



*(0 I ^1) 



where 



=q,_i(0|7r^)i->.(0|7r^r 
+ (l-g,_i(0|7ri.))'~"- (l-p,(0|7r^))"\ 

If either <7i-i(0 | tt^-) or pi(0 | tt^) is missing (because Xk is not 
a parent of X in both Hi^i and B^), then assume a small value 
e e (0, 1) for it. 

(e) Solve Hi for the most likely joint occurrence of basic blocks, then 
reorder the basic blocks of P accordingly. 

Solving the history models in Steps 1(d) and 2(e) can be achieved, for exam- 
ple, by the variation of stochastic simulation mentioned in Section |2. II During 
the simulation, the variable that corresponds to the initial basic block is treated 
as an evidence, that is, its value remains fixed at 1 at all times. 

All other variables have their values updated regularly according to the prob- 
ability prescribed in Q , but the following special precaution is taken when up- 
dating the source-sink pairs of variables created as back edges are severed during 
the construction of the execution model. If Xa' and Xhi are, respectively, such 
a source and sink, then Xa' is never updated directly but rather has its value 
copied from that of Xb' whenever Xt,/ is updated. This is intended to ensure the 
semantic consistency that the creation of the two variables implicitly suggests 
as desirable. 

The reordering of P's basic blocks in the same two Steps 1(d) and 2(e) in- 
volves examining all the variables that have value 1 in the global joint value 
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assignment obtained as solution of the history model. This assignment is an 
approximation of an x* for which qi (x* ) is maximum given the evidence corre- 
sponding to the execution of the initial basic block. In x*, we expect variables 
with value 1 to constitute a unique directed path in the history model whose 
first variable corresponds to the initial basic block, provided we allow jumps 
between variables like the Xi,/ and Xa' above to be included in the path. As 
we discuss in Section 0] this expectation has in practice been verified for the 
approximations of x* that we obtain as well. Reordering the basic blocks is then 
simply a matter of placing the basic blocks of this directed path in a position 
inside P that ensures they are the first to be loaded into memory for execution 
on the next input. 

Note, finally, that node labels for both the execution models, via Q, and the 
history models, via Step 2(d), are stored concisely according to the noisy-OR 
assumption. By Q), all the conditional probabilities not explicitly stored may 
be obtained readily when needed during the simulation. 

4 Experimental results 

We have conducted extensive experiments to assess the performance of the strat- 
egy summarized in Steps 1 and 2 of Section |31 henceforth referred to as the 
Bayesian- network approach. Our goal has been twofold: first, to verify the 
approach's ability to provide better running times as a program is repeatedly 
run on the same input, possibly with the intervention of other inputs; secondly, 
to compare the running times under the Bayesian-network approach with those 
obtained under the gcc level-3 optimization (with no further code reordering) or 
Pettis-Hansen reordering (as implemented in Plto ^J). In the remainder 
of the paper, we refer to the latter two strategies concisely by the epithets 03 
and PH, respectively. 

The PH strategy can be viewed as operating precisely on the Gi graph of 
Section [3 In essence, what it does is to repeatedly concatenate basic-block 
chains greedily based on the counts that label the graph's edges. To this end, it 
first lets every basic block be a chain, and then proceeds by examining the edges 
that connect the end of a chain to the beginning of another and selecting the 
one that has the greatest label to join its end chains. When chains can no longer 
be joined, they are placed in a relative order that favors the most frequently 
taken branches. The program's basic blocks are then reordered accordingly. 
Plto, including as it does the functionality to do basic-block reordering from 
edge-labeled graphs like Gi, provides a convenient framework for implementing 
not only the PH strategy (which it does by default) but also the basic-block 
reordering prescribed by our Bayesian-network approach (which we lead it to 
achieve, as discussed below). 

In addition to this use of Plto to achieve the reordering of basic blocks, 
we also use it as part of the procedure to generate the graph d, as it already 
implements a considerable portion of the profiling functionality that is necessary 
to build that graph. However, Plto does this profiling separately for each 
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Table 1: Input distribution. 



Program 


Reference 


Train 


Test 


Reduced 


bzip2 


0-2 


3 


4 


5-7 


crafty 





1 


2 


3-5 


gap 





1 


2 


3-5 


gcc 


0-4 


5 


6 


7, 8 


gzip 


0-4 


5 


6 


7-21 


mcf 





1 


2 


3,4 


parser 





1 


2 


3-5 


twolf 





1 


2 


3,4 


vortex 


0-2 


3 


4 


5-7 


vpr 


0, 1 


2,3 


4, 5 


6-9 



procedure and does not provide the frequency counts that correspond to returns 
from executing procedures. But in Gi every node and edge must be properly 
placed (and, in the case of edges, labeled), so in essence we do the following 
addition to the processing of Plto. Let a be a basic block through which a 
certain procedure is called, and let 6 be the basic block that follows after the 
procedure is executed. Plto provides the edge labels inside the procedure's 
code but links a directly to b along with the label fab- But what we need in 
Gi, instead, is an edge directed from a to the procedure's entry basic block, and 
also an edge directed from the procedure's exit basic block to h. We then create 
each of these edges with label fab and eliminate [a ^ b). 

We concentrated our experiments on the SPEC programs listed in Table 
For each program, the inputs that appear in the suite's Reference, Train, Test, 
and Reduced sets are numbered sequentially from as indicated in the table. 
Within each of the four sets, our numbering follows the same order as used for 
the suite's files. 

We used for all our experiments an AMD Athlon running at 1.66 GHz with a 
256-Kbyte level-2 cache and 256 Mbytes of main memory. We used the RedHat 
7.3 Linux operating system (kernel version 2.4.18-3) and version 2.96 of gcc 
(always with the level-3 optimization option). Every running time we report is 
expressed in seconds and refers, for each program on each input, to the middle 
time of three runs (i.e., the one that remains after discarding the lowest and 
highest times). 

Let n p denote the number of distinct inputs to a program P from the SPEC 
suite (in Tabled inputs are then numbered from through np — 1). Our 
methodology for experimentation on P has been to apply the Bayesian-network 
approach to the sequence /i, . . . , /grip of inputs to P generated randomly in such 
a way that each of the np inputs appears exactly six times in the sequence. In 

■^We have omitted eon and perlbmk from our experiments because there seems to be some 
incompatibility with the Plto version that is current as we write. But the conclusions we 
draw in the sequel appear to be well supported by the programs we did consider, so we believe 
these two omissions to be essentially harmless. 
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order to apply the Bayesian-network approach to this sequence via Steps 1 and 2 
of Section|21 we first chose the values of ao and Tq as indicated in that section,'^ 
and then proceeded to one of the two steps on each of the 6np inputs with 
7=1- 

Whenever applying ||3J) for solving a history model by stochastic simulation, 
we let T vary from T = 10^ down to the first value below 1 that could be 
obtained by letting each new value for T be 98% of its predecessor. It then 
follows that T went through [1 — In 10^/ In 0.98] — 457 continually decreasing 
values. For each value, each non-source variable of the model was updated 
exactly once according to the probability in ©. At the end, searching through 
the variables for those with value 1 consistently yielded, in all occasions, the 
desired directed path from the initial basic block described in Section |31 

In our experiments, this is where our use of Plto's functionality once again 
comes in. Specifically, we revert to the d graph and modify its edge labels as 
follows before feeding it to Plto. First, edges that do not appear in the directed 
path that solves the history model have their labels set to 0. Then we scan the 
edges of that directed path, starting at the initial basic block, and change all 
edge labels by assigning a large label (at least n) to the firsts edge encountered, 
this first label minus 1 to the second edge, and so on. Indirectly, this necessarily 
leads Plto's default strategy (the PH strategy) to reorder the basic blocks as 
desired. 

Our results on the programs of Table are summarized in the plots of 
Figure 0] For each of the ten programs, the figure contains two sets of plots 
displayed side by side. The plots on the left refer to inputs on which the program 
is considerably slower than on those whose plots appear on the right (with only 
two exceptions, gcc and gzip, plots on the left correspond exactly to the inputs 
in the Reference set). The two plot sets for each program share the same 
abscissae representing the sequence numbers of the various input instances of 
that program in the randomly generated sequence of inputs. Each plot comprises 
six points connected by a line, each point referring to a different occurrence of 
the same input in the sequence. 

Dividing the plots for each program in this manner does more than simply 
solve the scale problem on the ordinate axis: at least qualitatively (and also 
quantitatively, provided one bears in mind the differences in scale between the 
left- and right-hand sides), the division highlights the fact that the program's 
performance on inputs on the left (those for which running times are larger) 
tends to improve noticeably as the sequence of inputs is played, particularly 
when the same input occurs in the sequence with little or no occurrence of 
other intervening inputs. This is to be contrasted with what happens to the 
inputs on the right (the ones for which running times arc smaller). With only 
a few exceptions, on these inputs the program tends to perform in a relatively 
unaltered way, yielding practically the same running time at all encounters with 
the same input. Interestingly, the exceptions are precisely those inputs for which 

•^We found qq = 0.8 to be satisfactory regardless of P, and also the particular value assigned 
to To to be practically immaterial. 
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Figure 4: (Continued). 
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Table 2: Performance data for bzip2. 



Input 




^first 


fffirst 


to3 


go3 




.9PH 





141.87 


148.13 


4.23 


148.91 


4.73 


148.11 


4.21 


1 


115.76 


119.34 


3.00 


122.51 


5.51 


119.47 


3.11 


2 


112.87 


118.34 


4.62 


120.76 


6.53 


118.03 


4.37 


3 


10.97 


12.28 


10.67 


12.07 


9.11 


11.98 


8.43 


4 


8.63 


9.67 


10.75 


9.77 


11.67 


9.63 


10.38 


5 


2.38 


2.54 


6.30 


2.65 


10.18 


2.59 


8.11 


6 


2.02 


2.07 


2.42 


2.07 


2.42 


2.02 


0.00 


7 


1.94 


1.95 


0.51 


2.04 


4.90 


1.97 


1.52 


Table 3: Performance data for crafty. 


Input 


^last 


ifirst 


5first 


to3 


503 




5PH 





125.03 


129.85 


3.71 


133.25 


6.16 


126.21 


0.94 


1 


18.06 


19.98 


9.61 


20.12 


10.24 


19.30 


6.42 


2 


2.88 


2.97 


3.03 


3.08 


6.49 


2.91 


1.03 


3 


0.56 


0.56 


0.00 


0.58 


3.45 


0.55 


-1.82 


4 


0.21 


0.21 


0.00 


0.23 


8.70 


0.21 


0.00 


5 


0.07 


0.07 


0.00 


0.07 


0.00 


0.07 


0.00 



the running times stand out within the sets on the right-hand side, that is, those 
that require substantially larger running times than the other inputs in their 
sets. This pattern of behavior is, of course, what we aimed at with the design 
summarized in Section 13 

In Tables El through 1111 one for each of the ten programs, we provide data 
that help interpret the plots of Figure 01 and also data for comparing our 
Bayesian-network approach with the 03 and PH strategies. In each table, the 
iiast column gives, for each input, the program's running time on the last (the 
sixth) occurrence of that input in the randomly generated sequence of inputs, 
while tfiist gives the running time on the input's first occurrence. Similarly, to3 
and tpu refer to the running times, respectively, of the 03 and PH versions of 
the program on that same input. Note that tn^st and to3 are expected to be 
practically the same for the input that happens to be the first in the randomly 
generated sequence (for this input, and recalling that all compilations do level-3 
optimization, the 03 strategy is indistinguishable from ours). 

The remaining three columns in the tables give the final percent gain of the 
Bayesian-network approach over the first encounter with each input, and also 
over the 03 and PH versions. These three gains are defined, respectively, as 

fffirst = 100(tfirst - ilast)Afirst, 903 = 100(^03 " ilast)Ao3, and gpH = 100(ipH - 

iiast)/ipH- With very rare exceptions, positive gains dominate the ten tables 
and indicate a superior performance, by up to 12.61% in the case that relates 
to the first execution on the input in question, 14.50% in the case of 03, and 
12.96% in the PH case, of the Bayesian-network approach. The figures for ga^st 
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Table 4: Performance data for gap. 



Input 




^first 


.gfirst 


t03 


903 


^PH 


5PH 





194.83 


200.70 


2.92 


203.78 


4.39 


200.98 


3.06 


1 


6.17 


6.57 


6.09 


6.49 


4.93 


6.41 


3.75 


2 


0.78 


0.80 


2.50 


0.83 


6.02 


0.79 


1.27 


3 


0.52 


0.55 


5.45 


0.56 


7.14 


0.53 


1.89 


4 


0.22 


0.22 


0.00 


0.22 


0.00 


0.24 


8.33 


5 


0.06 


0.06 


0.00 


0.06 


0.00 


0.06 


0.00 


Table 5: Performance data for gcc. 


Input 


^last 


ifirst 


5first 


^03 


503 




5PH 





80.15 


83.53 


4.05 


85.89 


6.68 


83.31 


3.79 


1 


86.43 


89.98 


3.95 


92.82 


6.89 


89.78 


3.73 


2 


8.78 


9.98 


12.02 


10.27 


14.50 


8.95 


1.90 


3 


14.73 


15.90 


7.55 


15.92 


7.66 


14.97 


1.80 


4 


46.98 


50.20 


6.41 


52.02 


9.69 


50.82 


7.56 


5 


3.01 


3.29 


8.51 


3.29 


8.51 


3.01 


0.00 


6 


1.16 


1.21 


4.13 


1.31 


11.45 


1.16 


0.00 


7 


0.32 


0.32 


0.00 


0.35 


8.57 


0.32 


0.00 


8 


0.06 


0.06 


0.00 


0.06 


0.00 


0.06 


0.00 



clearly corroborate our conclusions when analyzing Figure^ Also, gains over 
03 tend to surpass those over PH, once again with rare exceptions. 

5 Concluding remarks 

We have in this paper introduced a new approach to improving the usage of 
the instruction memory. Our approach is probabilistic in nature and has two 
main ingredients. The first ingredient is what we call the execution model and 
concerns each individual execution of a given program on a given input. The 
execution model is a Bayesian network whose node labels are built from a trace 
recorded as the program is run on the input. The second ingredient is what we 
call the history model. It too is a Bayesian network, one that is now focused 
on a given program as it runs on a varied stream of inputs: after the program 
is run on a new input, the resulting execution model is incorporated into the 
history model by a technique that updates node labels to the geometric average 
of two corresponding labels, one from the current history model, the other from 
the new execution model. 

The effect of this continual updating of the history model by data resulting 
from running the program on new inputs is that, for each new input, the actual 
code to be executed can take into account all the knowledge stored in the history 
model. The way this is achieved is by reordering basic blocks for use on the 
next input whenever the history model gets updated. This reordering is done in 
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Table 6: Performance data for gzip. 



Input 


hast 


^first 


fffirst 


t03 


903 




5ph 





36.38 


39.87 


8.75 


40.77 


10.77 


40.13 


9.34 


1 


18.65 


19.83 


5.95 


20.72 


9.99 


19.89 


6.23 


2 


46.11 


49.76 


7.34 


51.14 


9.84 


50.06 


7.89 


3 


38.11 


40.87 


6.75 


42.52 


10.37 


41.34 


7.81 


4 


68.08 


70.76 


3.79 


74.26 


8.32 


72.23 


5.86 


5 


24.14 


25.30 


4.58 


26.96 


10.46 


26.56 


9.11 


6 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


7 


0.74 


0.74 


0.00 


0.76 


2.63 


0.74 


0.00 


8 


0.27 


0.27 


0.00 


0.29 


6.90 


0.27 


0.00 


y 


U.oz 


U.oo 


1 on 
i.zU 


U.OO 


O.OO 


U.o4 




10 


0.68 


0.68 


0.00 


0.70 


2.86 


0.68 


0.00 


11 


1.23 


1.23 


0.00 


1.27 


3.15 


1.25 


1.60 


12 


0.70 


0.73 


4.11 


0.73 


4.11 


0.71 


1.41 


13 


0.28 


0.28 


0.00 


0.30 


6.67 


0.28 


0.00 


14 


0.54 


0.54 


0.00 


0.57 


5.26 


0.54 


0.00 


15 


0.68 


0.69 


1.45 


0.72 


5.55 


0.69 


1.44 


16 


1.01 


1.01 


0.00 


1.05 


3.80 


1.02 


0.98 


17 


0.69 


0.69 


0.00 


0.71 


2.82 


0.70 


1.43 


18 


0.29 


0.29 


0.00 


0.30 


3.33 


0.29 


0.00 


19 


2.23 


2.24 


0.45 


2.32 


3.88 


2.27 


1.76 


20 


0.68 


0.68 


0.00 


0.71 


4.23 


0.69 


1.44 


21 


1.41 


1.42 


0.71 


1.49 


5.37 


1.44 


2.08 



Table 7: Performance data for mcf . 



Input 


^last 


ifirst 


.(/first 


to3 


S03 




.^PH 





767.03 


789.98 


2.91 


789.98 


2.91 


786.53 


2.48 


1 


66.05 


69.23 


4.59 


69.34 


4.74 


69.34 


4.74 


2 


0.36 


0.37 


2.70 


0.37 


2.70 


0.36 


0.00 


3 


1.85 


1.85 


0.00 


1.86 


0.54 


1.86 


0.54 


4 


0.13 


0.13 


0.00 


0.13 


0.00 


0.13 


0.00 



Tablc^ 8: P(Tforman('(^ data for parser. 



Input 










!JO:) 


/.pii 


.'7PH 





403.08 


408.93 


1.45 


414.80 


2.84 


408.07 


1.24 


1 


9.06 


9.62 


5.82 


9.80 


7.55 


9.63 


5.92 


2 


2.94 


2.93 


-0.34 


3.09 


4.85 


3.01 


2.33 


3 


3.38 


3.41 


0.88 


3.45 


2.03 


3.41 


0.88 


4 


0.4G 


0.48 


4.17 


0.48 


4.17 


0.46 


0.00 


5 


0.20 


0.20 


0.00 


0.20 


0.00 


0.20 


0.00 
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Table 9: Performance data for twolf . 



Input 


ilast 


^first 


gdrst 


t03 


go3 




gpK 





865.97 


884.64 


2.11 


884.64 


2.11 


875.21 


1.06 


1 


20.53 


21.36 


3.89 


21.76 


5.65 


21.54 


4.69 


2 


0.20 


0.20 


0.00 


0.21 


4.76 


0.20 


0.00 


3 


0.71 


0.71 


0.00 


0.72 


1.39 


0.70 


-1.43 


4 


0.07 


0.08 


12.50 


0.08 


12.50 


0.08 


12.50 



Table 10: Performance data for vortex. 



Input 




^first 


.9first 


to3 


503 


^PH 


.9PH 





108.78 


114.21 


4.75 


116.00 


6.22 


114.32 


4.85 


1 


91.81 


95.67 


4.03 


97.50 


5.84 


95.97 


4.33 


2 


136.02 


140.25 


3.02 


143.29 


5.07 


141.12 


3.61 


3 


9.29 


9.76 


4.82 


10.13 


8.29 


9.87 


5.88 


4 


5.39 


5.53 


2.53 


5.77 


6.59 


5.54 


2.71 


5 


0.61 


0.61 


0.00 


0.65 


6.15 


0.59 


-3.39 


6 


0.21 


0.22 


4.55 


0.22 


4.55 


0.21 


0.00 


7 


0.05 


0.05 


0.00 


0.05 


0.00 


0.05 


0.00 



Table 11: Performance data for vpr. 



Input 


ilast 


ifirst 


S'first 


t03 


go3 




5ph 





226.91 


233.47 


2.81 


240.89 


2.81 


237.21 


4.34 


1 


206.21 


211.91 


2.69 


225.39 


8.51 


212.52 


2.97 


2 


10.88 


12.45 


12.61 


12.45 


12.61 


12.50 


12.96 


3 


18.46 


19.58 


5.72 


21.05 


12.30 


19.76 


6.58 


4 


1.19 


1.18 


-0.85 


1.24 


4.03 


1.22 


2.46 


5 


0.73 


0.74 


1.35 


0.77 


5.19 


0.74 


1.35 


6 


0.17 


0.17 


0.00 


0.18 


5.56 


0.17 


0.00 


7 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


8 


0.02 


0.02 


0.00 


0.01 


-100.00 


0.01 


-100.00 


9 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 
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such a way that the program's basic blocks that according to the history model 
are the most likely to be executed are the ones to occupy the highest levels of 
the instruction-memory hierarchy. 

Incorporating an execution model into the history model can be achieved 
in various ways, even if we restrict ourselves to using the geometric-average 
criterion. The particular way we have chosen in this paper has been to select 
weights for the geometric average that lets comparatively longer executions in- 
fluence the history model more heavily than comparatively shorter ones. Our 
results on selected SPEC programs demonstrate the efficacy of the history model 
in improving the running times of programs on precisely those inputs for which 
such times are longer. They also demonstrate, for the majority of the cases 
we investigated, that running times tend to become better by a non-negligible 
margin than those obtained by 03 or PH optimization. 

As becomes apparent from Sections |21 through 01 maintaining a Bayesian- 
network history model for a program as it is run on a sequence of inputs depends 
on strategy and parameter choices that are not necessarily unique. This is true 
of our choice of a geometric average to combine two Bayesian networks, and 
also of the functional form of (|15() to select weights for the geometric average. It 
is similarly true of our choice method for solving the history model (stochastic 
simulation coupled with simulated annealing) and of the parameters involved. 

But however arbitrary some of these choices are, running an instrumented 
version of the program for trace recording and solving the history model are 
costly procedures, so one naturally wonders about the practicality of the overall 
approach in a real- world context. Our vision here is that the strategy sum- 
marized in Steps 1 and 2 of Section is not to be applied to the sequence of 
all the inputs that come along for execution by program P, but rather on a 
subsequence of that sequence, for example as follows. 

When input Ii arrives, two instances of P are started on it. The first instance 
is not instrumented and returns the result of the execution as soon as it becomes 
available. The second instance, in turn, is instrumented and yields an execution 
model to be incorporated into the history model for P. New inputs that appear 
in the meantime only cause one instance of P to be started (the one that is not 
instrumented). Once the history model for P has been updated and solved, and 
a corresponding reordered code has been obtained, a new input for P may then 
once again trigger two executions of P, but now employing the newly reordered 
code. In this vision, a background system can be dedicated to maintaining 
history models and from time to time releasing versions of crucial programs 
that are tuned to the types of demand they have encountered. 
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